06_2_A_ESmoothing_Trended

Libraries

library(fpp3)
── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
✔ tibble      3.2.1     ✔ tsibble     1.1.3
✔ dplyr       1.1.2     ✔ tsibbledata 0.4.1
✔ tidyr       1.3.0     ✔ feasts      0.3.1
✔ lubridate   1.9.2     ✔ fable       0.3.3
✔ ggplot2     3.4.3     ✔ fabletools  0.3.3
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
library(patchwork)

References

  1. Hyndman, R.J., & Athanasopoulos, G. (2021). Forecasting: principles and practice (3rd ed.). https://otexts.com/fpp3/

  2. Fable package documentation. Retrieved from:

  3. Hyndman, R. J., Koehler, A. B., Ord, J. K., & Snyder, R. D. (2008). Forecasting with exponential smoothing: The state space approach. Springer-Verlag.

Note

This notebook is not original material, but rather based on ref. 1 with notes expanding on some of the concepts contained in the book.

Motivation

So far we have seen Simple Exponential Smoothing, a forecasting model that lied in between the Mean and the Naive models:

  • The naive model assigned a constant value to all future forecasts, the last observation. It assigns all the weight to the latest observation

  • The mean model assigned a constant value to all future forecasts with an equal weight for each observation regardless of when it occurred.

  • Simple Exponential Smoothing also assigns a constant value to all future forecast, but this value is a weighted average that assigns the bigger weights to the most recent observations. The further in the past an observation lies, the smaller its assigned weight.

We now want to introduce trend and seasonality in exponential smoothing

Let us first recall the component form of simple exponential smoothing:

\[\begin{align*} \text{Forecast equation} && \hat{y}_{t+h|t} & = \ell_{t}\\ \text{Smoothing equation} && \ell_{t} & = \alpha y_{t} + (1 - \alpha)\ell_{t-1}, \end{align*}\]
  • \(l_t\) is the level (or the smoothed value) of the series at time \(t\)

  • \(h = 1\) gives the fitted values

  • \(t = T\) gives the true forecasts beyond the training data

Forecast equation

  • The forecast at t+1 is the estimated level at time t.

Smoothing equation

  • The smoothing equation gives the estimated level of the series at each period t.

Flat forecasts

  • SES has a flat forecast function –> all forecasts take the same value, equal to the last level component.
\[\begin{align*} \hat{y}_{T+h|T} = \hat{y}_{T+1|T}= l_T \qquad h=2,3,\dots. \end{align*}\]

Holt’s Linear Trend Method

Holt extended Simple Exponential Smoothing to allow the forecasting of data with trend.

Component form of the equations

Unlike for Simple Exponential Smoothing, we will not derive the equations. We will accept them and then interpret them further down below. For those interested, see reference 3.

Differences with Simple Exponential Smoothing:

  • Introduces new parameter for the trend: \(b_t\) (analogous to the level \(l_t\)).
    • This implies inroducing a new equation for the trend.
  • The trend has its own smoothing parameter \(\beta^*\) (analogous to \(\alpha\)).

\[ \begin{align*} \text{Forecast equation}&& \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} \\ \text{Level equation} && \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ \text{Trend equation} && b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1} \end{align*} \]

where:

  • \(lt\) denotes an estimate of the level of the series at time t.
  • \(b_t\) denotes an estimate of the trend (slope) of the series at time t.
  • \(\alpha\) is the smoothing paraeter for the level. \(0 <= \alpha <= 1\)
  • \(\beta^*\) is the smoothing parameter for the trend. \(0 <= \beta^* <= 1\).
    • The reason for using \(\beta^*\) instead of \(\beta\) will be explained when talking about innovations state space models.

Equations for the fitted values

To understand how these equations help fit a time series with trend, let us particularize the forecast equation for \(h=1\) (fitted values are one-step ahead forecasts on the training data). We will subsequently particularize the equation at time \(t\) instead of the resulting \(t+1\) to make it consistent with the level and trend equations:

\[ \begin{align*} \text{Forecast equation for h = 1} && \hat{y}_{t+1|t} &= \ell_{t} + 1 \cdot b_{t} \\ \text{Particularize for t instead t+1} && \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} && \text{(fitted value at t)} \\ \end{align*} \]

Therefore, the equations for the fitted values are:

\[ \begin{align*} \text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + b_{t-1} \\ \text{Level at t} && \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ \text{Trend at t} && b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1} \end{align*} \]

These equations allow us to determine the fitted values if given:

  1. The values of the smoothing parameters \(\alpha\), \(\beta^*\).
  2. An initial value for the level \(l_{0}\).
  3. An initial value for the trend \(b_{0}\).

The process to compute the fitted values having this data is as follows:

  • For \(t = 1\):
    1. \(\hat{y}_{1|0} = \ell_{0} + b_{0}\)
    2. \(l_{1} = \alpha y_{1} + (1 - \alpha)(l_{0} + b_{0})\)
      • The value of \(l_1\) will be used in the trend equation next.
    3. \(b_{1} = \beta^*(l_{1} - l_{0}) + (1 -\beta^*)b_{0}\)
  • For \(t = 2\): we proceed in the same manner after having computed \(l_1\) and \(b_1\)
  • For \(t = 3\): we proceed in the same manner after having computed \(l_2\) and \(b_2\)

The figure below helps picture the process:

Interpreting the equations

Now we are ready to make an interpretation of the equations.

Level equation

The level equation shows that \(l_t\) is a weighted average of the observation \(y_t\) and the one-step-ahead training forecast for time t. This weighted average is determined by the parameter \(\alpha\). That is, \(\alpha\) is the smoothing parameter for the level.

\[ \begin{align*} \text{Level equation} && \ell_{t} &= \alpha \underbrace{y_{t}}_{\substack{\text{observation} \\ \text{at time } t}} + (1 - \alpha)\underbrace{(\ell_{t-1} + b_{t-1})}_{\substack{\text{one-step ahead} \\ \text{forecast at } t}} \end{align*} \]

  • NOTE: we can see that \(\ell_{t-1} + b_{t-1}\) is the one step ahead forecast at time \(t\) by particularizing the forecast equation at time \(t\) and substituting for \(h = 1\). That is: \(1 \cdot b_{t}\)

\[ \begin{align*} \text{Forecast equation for h = 1} && \hat{y}_{t+1|t} &= \ell_{t} + 1 \cdot b_{t} \\ \text{Particularize for t instead t+1} && \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} \\ \end{align*} \]

  • As you can see, this shows that \(\ell_{t-1} + b_{t-1}\) is the one step ahead forecast at time \(t\).

Trend equation

The trend equation shows that \(b_t\) is a weighted average of the estimated trend at time t and the previous estimate of the trend. This weighted average is determined by the parameter \(\beta^*\)

\[ \begin{align*} \text{Trend equation} && b_{t} &= \beta^*\underbrace{(\ell_{t} - \ell_{t-1})}_{\substack{\text{estimated trend} \\ \text{at time } t}} + (1 -\beta^*)\underbrace{b_{t-1}}_{\substack{\text{previous} \\ \text{estimate} \\ \text{of the trend}}} \end{align*} \]

  • NOTE: \(\ell_{t} - \ell_{t-1}\) is an estimate of the trend at time \(t\) in terms of \(l_t\) and \(l_{t-1}\). That is, only in terms of the information up to \(t\) (not beyond \(t\)).
    • If we take the level \(l_t\) as an estimate of the value of time series, then the difference between consecutive levels divided by 1 (the distance in timesteps between \(l_t\) and \(l_{t-1}\)) is an estimate of the slope at time t. The figure below should clarify this:

Equation for the fitted values

The fitted value at time t is simply the previous level plus the previous slope. That is: starting at \(l_{t-1}\) advancing one timestep with the slope \(b_{t-1}\) the fitted value at \(t\) is obtained.

\[ \begin{align*} \text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + 1 \cdot b_{t-1} \\ \end{align*} \]

Equation for the forecasts

Once the fitted values \(\hat{y_t}\), the levels \(l_t\) and the slopes \(b_t\) have been computed for every value of \(t\) in the training region (from \(t=1\) to \(t=T\)) (that is, once we have computed the fitted values), the forecasts are given by:

\[ \begin{align*} \text{Forecast equation}&& \hat{y}_{T+h|T} &= \ell_{T} + hb_{T} \end{align*} \]

By design, the forecast function is no longer flat but trending with a constant slope \(b_T\) (the last estimate of the trend)

  • h-step ahead forecast is equal to the last estimated level \(l_T\) plus h times the last estimated trend value \(b_T\) As such, the forecasts are a linear function of h.

Effect of the value of \(\beta*\)

A value of the parameter beta that is too high can result in a trend component that is too responsive and leads to fitted values that follow the data too closely. Following all the small floctuations in the data leads to overfitting.

The following graph shows a time series (in red) and the corresponding fited values for \(\alpha=0.4\) and \(\beta\) ranging from 0 to 1.

Fitting process

Fitting Holt’s Method to a time series implies the following process to minimize the Sum of Squared Residuals (SSE). In this case, we are not going to implement this from scratch as we did for simple exponential smoothing. An understanding of the process is sufficient.

As an example, the table below shows steps 1 and 2 assuming initial estimated values \(\hat{\alpha} = 0.9999\), \(\hat{\beta^*} = 0.3267\), \(l_t = 10.05\) and \(b_t = 0.22\). The table also extends ten steps beyond the training data, generating the corresponding forecasts, based on the latest fitted values of the level, the slope and the smoothing parameters:

Year Time Observation Level Slope Forecast
\(t\) \(y_t\) \(l_t\) \(\hat{y}_{t+1|t}\)
1959 0 10.05 0.22
1960 1 10.28 10.28 0.22 10.28
1961 2 10.48 10.48 0.22 10.50
1962 3 10.74 10.74 0.23 10.70
1963 4 10.95 10.95 0.22 10.97
1964 5 11.17 11.17 0.22 11.17
1965 6 11.39 11.39 0.22 11.39
1966 7 11.65 11.65 0.23 11.61
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
2014 55 23.50 23.50 0.37 23.52
2015 56 23.85 23.85 0.36 23.87
2016 57 24.21 24.21 0.36 24.21
2017 58 24.60 24.60 0.37 24.57
\(h\) \(\hat{y}_{T+h|T}\)
2018 1 24.97
2019 2 25.34
2020 3 25.71
2021 4 26.07
2022 5 26.44
2023 6 26.81
2024 7 27.18
2025 8 27.55
2026 9 27.92
2027 10 28.29

Exercise: reproduce this table in excel (template given by the professor)

R Example: Australian Population

Australia’s population from 1960 to 2017.

aus_economy <- global_economy %>%
  filter(Code == "AUS") %>%
  mutate(Pop = Population / 1e6)
autoplot(aus_economy, Pop) +
  labs(y = "Millions", title = "Australian population")

Clearly, this is trended data. Trended exponential smoothing is therefore sensible.

The fable library performs all the fitting automatically using the following commands

fit <- aus_economy %>%
  model(
    AAN = ETS(Pop ~ error("A") + trend("A") + season("N"))
  )

Remember the output is a mable or model table. In other words, a table object containing the model objects we fitted (in this case only one model)

fit
# A mable: 1 x 2
# Key:     Country [1]
  Country            AAN
  <fct>          <model>
1 Australia <ETS(A,A,N)>

As always, we can check the values of the parameters using the function tidy

tidy(fit)
# A tibble: 4 × 4
  Country   .model term  estimate
  <fct>     <chr>  <chr>    <dbl>
1 Australia AAN    alpha    1.00 
2 Australia AAN    beta     0.327
3 Australia AAN    l[0]    10.1  
4 Australia AAN    b[0]     0.222

We may also extract the fitted values and other information using augment. In he example below I also compute the sum of square residuals compute the sum of square residuals using augment()

fitted_vals <- 
  fit %>% 
  augment()

resid <-  fitted_vals$.innov

# Sum of squared residuals
SSE <- sum(resid^2)
SSE
[1] 0.2231852

We may generate forecasts as usual using forecast() on the fitted object:

fc <- 
  fit %>% 
  forecast(h = 10)

fc %>% autoplot(aus_economy, level = NULL) +
  labs(title = "Australian population",
       y = "Millions") +
  guides(colour = guide_legend(title = "Forecast"))

As expected, the h-step ahead forecast is equal to the last estimated level plus h times the last estimated trend value. As such, the forecasts are a linear function of h.

Components of the model

They way in which exponential smoothing proceedes may be considered as a decomposition of the time series in terms of multiple components. In the trended case, in level \(l_t\) and slope \(b_t\).

We may use the function components() on an ETS model to retrieve the values of these components in the training region. In the code above, note that we get an additional year 1959 corresponding to \(t=0\), where we have \(l_0\) and \(b_0\).

holts_components <- 
  fit %>% 
    components()

holts_components
# A dable: 59 x 7 [1Y]
# Key:     Country, .model [1]
# :        Pop = lag(level, 1) + lag(slope, 1) + remainder
   Country   .model  Year   Pop level slope remainder
   <fct>     <chr>  <dbl> <dbl> <dbl> <dbl>     <dbl>
 1 Australia AAN     1959  NA    10.1 0.222 NA       
 2 Australia AAN     1960  10.3  10.3 0.222 -0.000145
 3 Australia AAN     1961  10.5  10.5 0.217 -0.0159  
 4 Australia AAN     1962  10.7  10.7 0.231  0.0418  
 5 Australia AAN     1963  11.0  11.0 0.223 -0.0229  
 6 Australia AAN     1964  11.2  11.2 0.221 -0.00641 
 7 Australia AAN     1965  11.4  11.4 0.221 -0.000314
 8 Australia AAN     1966  11.7  11.7 0.235  0.0418  
 9 Australia AAN     1967  11.8  11.8 0.206 -0.0869  
10 Australia AAN     1968  12.0  12.0 0.208  0.00350 
# ℹ 49 more rows
# Depict the components
holts_components %>% autoplot()
Warning: Removed 1 row containing missing values (`geom_line()`).

Just as we did in the separate excel files, the fitted values may be computed from the trend and the slope returned when using components(), using the forecast equation particularized at \(h=1\) and \(t=t\) (equation derived previousl in this notebook):

\[ \begin{align*} \text{Fitted value at t}&& \hat{y}_{t|t-1} &= \ell_{t-1} + b_{t-1} \end{align*} \]

# Use the equation for the fitted values
holts_components <- 
  holts_components %>% 
  mutate(
    .fitted = lag(level) + lag(slope) # Equation for the fitted vals
  )

We may check that these are the same fitted values returned when using augment() on the model:

# Vector of fitted values computed using augment
fitted_augment <- 
  fit %>% 
    augment() %>% 
    pull(.fitted)

# Vector of fitted values computed using the components() and the equation
fitted_components <- 
  holts_components %>% 
  pull(.fitted)

# Remove NAs
fitted_components <-  fitted_components[!is.na(fitted_components)]

all.equal(fitted_augment, fitted_components)
[1] TRUE

Damped trend methods

  • Holt’s linear method \(\rightarrow\) constant trend
    • Problematic for longer forecasts horizons (forecasts too high)
  • Dampening parameter \(\rightarrow\) dampens the trend line to a flat line at some point in the future.
    • Very popular method, proven to be quite successful

    • Additional parameter on top of those included in Holt’s method \(0<\phi<1\)

    • For \(\phi = 1\) the method is equivalent to Holt’s method

    • For \(0 < \phi < 1\) the trend is dampened and the forecasts approach a constant value some time into the future. Specifically the value is:

      • \(\ell_T+\phi b_T/(1-\phi)\) as \(h \rightarrow \infty\)

The reason for the above value is that the dampening term is actually a geometric progression of ratio \(\phi\):

\[\begin{align*} \text{Forecast equation dampened method} && \hat{y}_{t+h|t} &= \ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t} \\ \text{Forecast equation Holt's method} && \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} \\ \end{align*}\]

As you can see the \(h\) factor has been replaced by the following geometric progression:

\[\begin{align*} (\phi+\phi^2 + \dots + \phi^{h}) \end{align*}\]

Since the ratio \(\phi\) is smaller than 1, the geometric progression converges and its sum amounts to \(\ell_T+\phi b_T/(1-\phi)\) as \(h \rightarrow \infty\)

Usual values for \(\phi\)

  • Rarely less than 0.8 (very strong effect for smaller values)
  • Rarely greater than 0.98 (otherwise very similar to undamped model)

Damped example 1: specify a value for \(\phi\)

fit_damped_1 <- 
  aus_economy %>%
    model(
      holts = ETS(Pop ~ error("A") + trend("A") + season("N")),
    
    # Damping parameter explicitly set to 0.9
     damped_holts = ETS(Pop ~ error("A") + trend("Ad", phi = 0.9) + season("N"))
  )

fc_damped_1 <-  
  fit_damped_1 %>% 
  forecast(h = 15) 

fc_damped_1 %>%
  autoplot(aus_economy, level = NULL) +
  labs(title = "Australian population",
       y = "Millions") +
  guides(colour = guide_legend(title = "Forecast"))

Let us re-format that output to build a table comparing the forecasts of both models:

fc_damped_1 %>%
  as_tibble() %>%
  select(Year, .model, .mean) %>% 
  pivot_wider(values_from = c(`.mean`), names_from = c(`.model`))
# A tibble: 15 × 3
    Year holts damped_holts
   <dbl> <dbl>        <dbl>
 1  2018  25.0         24.9
 2  2019  25.3         25.2
 3  2020  25.7         25.5
 4  2021  26.1         25.8
 5  2022  26.4         26.0
 6  2023  26.8         26.2
 7  2024  27.2         26.4
 8  2025  27.6         26.5
 9  2026  27.9         26.7
10  2027  28.3         26.8
11  2028  28.7         26.9
12  2029  29.0         27.0
13  2030  29.4         27.1
14  2031  29.8         27.2
15  2032  30.1         27.3

You can see that the damped method returns a lower forecast for every point, as expected.

Damped example 2: specify a range of possible values for \(\phi\) and let ETS() picked the best

Specifying a grid of values for the value of phi: we can specify a range of values for the parameter phi and let the fpp3 library try multiple values until it finds the best fit:

fit_damped_2 <-
  aus_economy %>%
      model(
        holts = ETS(Pop ~ error("A") + trend("A") + season("N")),
        
        # Specify a grid of values c(0.8, 0.95) for phi
        damped_holts = ETS(Pop ~ error("A") + 
                                 trend("Ad", phi = NULL, phi_range=c(0.8, 0.95)) +
                                 season("N")
                      )
    )
tidy(fit_damped_2)
# A tibble: 9 × 4
  Country   .model       term  estimate
  <fct>     <chr>        <chr>    <dbl>
1 Australia holts        alpha    1.00 
2 Australia holts        beta     0.327
3 Australia holts        l[0]    10.1  
4 Australia holts        b[0]     0.222
5 Australia damped_holts alpha    0.533
6 Australia damped_holts beta     0.533
7 Australia damped_holts phi      0.950
8 Australia damped_holts l[0]     9.96 
9 Australia damped_holts b[0]     0.207
fc_damped_2 <- 
  fit_damped_2 %>% 
  forecast(h = 15)

fc_damped_2 %>%
  autoplot(aus_economy, level = NULL) +
  labs(title = "Australian population",
       y = "Millions") +
  guides(colour = guide_legend(title = "Forecast"))

Damped example 3: model without specifying \(\phi\) and letting ETS() pick best value.

Again, the best value picked will be that which minimizes SSE.

fit_damped_3 <-
  aus_economy %>%
      model(
        holts = ETS(Pop ~ error("A") + trend("A") + season("N")),
        damped_holts = ETS(Pop ~ error("A") + trend("Ad") + season("N"))
      )
tidy(fit_damped_3)
# A tibble: 9 × 4
  Country   .model       term  estimate
  <fct>     <chr>        <chr>    <dbl>
1 Australia holts        alpha    1.00 
2 Australia holts        beta     0.327
3 Australia holts        l[0]    10.1  
4 Australia holts        b[0]     0.222
5 Australia damped_holts alpha    0.999
6 Australia damped_holts beta     0.427
7 Australia damped_holts phi      0.980
8 Australia damped_holts l[0]    10.0  
9 Australia damped_holts b[0]     0.248
fit_damped_3 <- 
  fit_damped_3 %>% 
  forecast(h = 15)

fit_damped_3 %>%
  autoplot(aus_economy, level = NULL) +
  labs(title = "Australian population",
       y = "Millions") +
  guides(colour = guide_legend(title = "Forecast"))

Exercise 1: internet usage

Number of users connected to the internet via a server. Data observed over 100 minutes. Type WWWusage to find out more details aboout the dataset (loaded along the fpp3 package).

# Convert time series object to tsibble object for compatibility with fable methods and tools
www_usage <- as_tsibble(WWWusage)

# Plot the data
www_usage %>% 
  autoplot(value) +
  labs(x="Minute", y="Number of users",
       title = "Internet usage per minute")

1. Fit models

Fit the following models:

  1. A Simple exponential smoothing model
  2. A Holts additive model (exp. smoothing with additive trend)
  3. A damped Holts additive model (exp. smoothing with additive damped trend)

2. Briefly compare residuals of Holt’s model and damped Holt’s model

3. Perform forecasts

Use the three models and depict the forecasts. Create two separate graphs:

  • Graph 1: forecasts of all model along with the historical data. Do not print confidence intervals
  • Graph 2: forecasts of the damped model with confidence intervals

Exercise 2: Internet usage

This exercise is somewhat of a repetition of exercise 1, only this time using cross validation.

Fit the following models:

  1. A Simple exponential smoothing model
  2. A Holts additive model (exp. smoothing with additive trend)
  3. A damped Holts additive model (exp. smoothing with additive damped trend)
# Convert time series object to tsibble object for compatibility with fable methods and tools
www_usage <- as_tsibble(WWWusage)

# Plot the data
www_usage %>% 
  autoplot(value) +
  labs(x="Minute", y="Number of users",
       title = "Internet usage per minute")

Use cross-validation to assess which method works best for the data at hand.

  • Consider forecasts of up to a horizon of up to 10.
  • The smallest training dataset shall cover 80% of the observations.
  • Compute the accuracy metrics both for each forecast horizon and model as well as for each model averaged over all forecast horizons.

When you are done, take your best model on average over all forecast horizons and:

  • Assess its residuals
  • Forecast 10 timesteps ahead and depict the forecasts along with the prediction intervals and the historical data.

Step 1: create training datasets

Create the different datasets used for cross validation using stretch_tsibble(). Smallest dataset set to 10 observations.

Step 2: fit models to training datasets

Fit the models to be compared to the datasets used for cross-validation:

Step 3: Perform forecasts and compute accuracy metrics